Emotion AI¶
Nutshell¶
In this project I build a program that classifies emotions from images of human faces, as explained on the course Modern Artificial Intelligence, lectured by Dr. Ryan Ahmed, Ph.D. MBA.
The data set I use is from https://www.kaggle.com/c/facial-keypoints-detection/overview and consists of over 20000 facial images that have been labeled with facial expression/emotion and approximately 2000 images with their keypoint annotations.
The program will train two models which will detect
- facial keypoints
- detect emotions.
Then these models are combined into one model that will provide the keypoints and the emotion as the output.
A short recap of artificial neuronal networks¶
Artificial neurons are built in a similar way as human neurons. The artificial neurons take in signals through input channels (dendrites in human neurons) and processes information through transfer functions (cell bodies) and generates an output (which would travel through the axon of a neuronal cell).
Fig. 1. Side by side view of artificial and biological neurons. Credit: Top image from Introduction to Psychology (A critical approach) Copyright © 2021 by Rose M. Spielman; Kathryn Dumper; William Jenkins; Arlene Lacombe; Marilyn Lovett; and Marion Perlmutter licensed under a Creative Commons Attribution 4.0 International License. Bottom image Chrislb, CC BY-SA 3.0 , via Wikimedia Commons
For example lets consider an artificial neuron (AN) that takes three inputs: $x_1$, $x_2$, and $x_3$. We can then express the output of the artificial neuron mathematically as $y = \phi(X_1W_1 + X_2W_2 + X_3W_3 + b)$. Here $y$ is the output and the $W$s are the weights assigned to each input signal. $b$ is a bias term added to the weighted sum of inputs. $\phi$ is the activation function.
Some common modern activation functions used in neural networks are for example ReLU, GELU and the logistic activation function. ReLU is short for Rectified linear unit function and is defined as $\phi(x) = max(0,\alpha + x'b)$. ReLU is recommended for the hidden layers, since it outputs a linear response for positive values. This helps maintain larger gradients and makes training deep networks more feasible.
The Gaussian Error Linear Unit (GELU) is a smoother version of the ReLU and is defined as $x\phi(x)$, where the $\phi(x)$ stands for Gaussian cumulative distribution function.
The logistic activation function is also called sigmoid function and is defined as $\phi(x) = \frac{1}{1+e^{-x}}$. It takes a number and sets it between 0 and 1 and thus is very helpful in output layers.
Training¶
All neural networks need to be trained with labeled data. The available data is generally devided to 80% training and 20% testing data. It is also recommended to further divide the training data into an actual training data set (e.g. 60%) and a validation data set (e.g. 20%).
Training is done by adjusting the weights of the network, by iteratively minimising the cost function using for example the gradient descent optimization algorithm. It works by calculating the gradient of the cost function and then takes a step to the negative direction until it reaches the local or global minimum.
A typical choice for a cost function is the quadratic loss, which is formulated as $f_{loss}(w,b)= \frac{1}{N}\sum^n_{i=1}(\hat y-y)$.
Gradient descent algorithm:
1. Calculate the derivative of the loss function $\frac{\delta f_{loss}}{\delta w}$
2. Pick random values for weights and substitute.
3. Calculate the step size, i.e. how much we will update our weights.
step size = learning rate * gradient $=\alpha*\frac{\delta f_{loss}}{\delta w}$
4. Update the parameters and repeat.
new weight = old weight - step size $w_{new}=w_{old}-\alpha*\frac{\delta f_{loss}}{\delta w}$
Below is an example for searching the minimum of a u-shaped funciton with gradient descent. Usually the situation is mulidimensional but the simplification is solved in a similar way.
Testing various learning rates helps undestand the importance of choosing the parameters of training.
As shown above too large learning rate can lead to missig the global minimum and/or the model does not converge as quickly. Equally problematic can be too small learning rates when the model does not learn. To solve the problems rising from too small or too large learning rates there are several approaches to adjust the learning rates dynamically.
Momentum is analogous to the balls tendency to keep rolling down hill. Momentum is used to speed up the learning when the error cost gradient is heading in the same direction for a long time, and slow down when a leveled area is reached. Momentum is controlled by a variable that is analogous to the mass of the ball rolling. A large momentum helps avoiding getting stuck in local minima, but might also push through the minima we wish to find. Thus, the parameter has to be selected carefully.
Learning rates can also be adjusted through decay, which basically reduces the learning rate by a certain amount after a fixed number of epochs. It can help solve above like situations, where too great learning rate makes the learning jump back and forth over a minimum.
Adagrad or Adam are examples of popular adaptive algorithms for optimising the gradient descent.
Network architectures¶
The artificial neurons are connected to each other to form neural networks and a plethora of different network architectures exist. To harness the power of AI, it is necessary to know which architecture serves the intended purpose best. Below are three common architectures and their applications.
Recurrent Neural Networks (RNNs) handle sequential data by maintaining a hidden state that captures information about previous elements in the sequence. Therefore they are great for contexts where the output depends on previous inputs, for example time series and natural language processing.
Generative Adversial Networks (GANs) consist of two neural networks - the Generator and the Discriminator. They sparr each other in a zero-sum game framework, where the genrator creates synthetic data that resembles real data and the discriminator evaluates whether it is rela or not. This dirves the generator to output increasingly realistic data. Obviously, this is the choice for many image generation and editing but also for anomaly detection in industiral and security contexts. GANs can model regular patterns and subsequently detect anomalies by comparing generated outputs with real inputs.
Convolutional Neural Networks (CNN) are designed to process data with a grid-like topology and are most commonly used in image analysis. They utilise convolutional layers to learn spatial hierarchies by applying filters (kernels) that slide (convolve) over the input. They usually involve pooling layers that reduce the spatial dimensions and fully connected layers that map the extracted features to outputs.
Fig. 2. Convolutional neural network. Credit: Aphex34, CC BY-SA 4.0, via Wikimedia Commons
In the Emotion AI, I will use the Residual network (ResNet), which is a Residual Neural Network. Resnet's architecture includes "skip connection" features which enables training very deep networks wihtout vanishing gradient issues. Vanishing gradient problems occurs when the gradient is back-propagated to earlier layers and the resulting gradient is very small.The skip connection feature works by passing the input of one layer to a layer further down in the network. This is also called identity mapping. The ResNet model that I use has been pretrained with the ImagNet dataset.
Fig. 3. Identity mapping. Credit: LunarLullaby, CC BY-SA 4.0, via Wikimedia Commons
Part 1. Key facial points detection¶
In this section I program the DL model with convolutional neural network and residual blocks to predict facial keypoints. The data set is from https://www.kaggle.com/c/facial-keypoints-detection/overview.
The dataset consists of input images with 15 facial key points each. The training.csv file has 7049 face images with corresponding keypoint locations. The test.csv file has face images only, and will be used to test the model. The images are strings of numbers in the shape of (2140,). That has to be transformed into the real shape of the images (96, 96). Thus we create a 1-D array of the string and reshape it to 2D array.
The model I build will have the architecture presented below. The Resblock consists of two different type of blocks: Convolution block and identity block. As seen below, both of them have an additioinal short path to add the original input to the output. For the Covolution block this includes few extra steps to shape the input to the same dimensions as the output from the longer path.
key_points_df['Image'].shape
key_points_df['Image'][0]
type(key_points_df['Image'][0])
key_points_df['Image'] = key_points_df['Image']. apply(lambda img: np.fromstring(img, dtype = int, sep = ' ').reshape(96,96))
key_points_df['Image'][0].shape
(96, 96)
key_points_df.describe()
| left_eye_center_x | left_eye_center_y | right_eye_center_x | right_eye_center_y | left_eye_inner_corner_x | left_eye_inner_corner_y | left_eye_outer_corner_x | left_eye_outer_corner_y | right_eye_inner_corner_x | right_eye_inner_corner_y | ... | nose_tip_x | nose_tip_y | mouth_left_corner_x | mouth_left_corner_y | mouth_right_corner_x | mouth_right_corner_y | mouth_center_top_lip_x | mouth_center_top_lip_y | mouth_center_bottom_lip_x | mouth_center_bottom_lip_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | ... | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 | 2140.000000 |
| mean | 66.221549 | 36.842274 | 29.640269 | 37.063815 | 59.272128 | 37.856014 | 73.412473 | 37.640110 | 36.603107 | 37.920852 | ... | 47.952141 | 57.253926 | 63.419076 | 75.887660 | 32.967365 | 76.134065 | 48.081325 | 72.681125 | 48.149654 | 82.630412 |
| std | 2.087683 | 2.294027 | 2.051575 | 2.234334 | 2.005631 | 2.034500 | 2.701639 | 2.684162 | 1.822784 | 2.009505 | ... | 3.276053 | 4.528635 | 3.650131 | 4.438565 | 3.595103 | 4.259514 | 2.723274 | 5.108675 | 3.032389 | 4.813557 |
| min | 47.835757 | 23.832996 | 18.922611 | 24.773072 | 41.779381 | 27.190098 | 52.947144 | 26.250023 | 24.112624 | 26.250023 | ... | 24.472590 | 41.558400 | 43.869480 | 57.023258 | 9.778137 | 56.690208 | 32.260312 | 56.719043 | 33.047605 | 57.232296 |
| 25% | 65.046300 | 35.468842 | 28.472224 | 35.818377 | 58.113054 | 36.607950 | 71.741978 | 36.102409 | 35.495730 | 36.766783 | ... | 46.495330 | 54.466000 | 61.341291 | 72.874263 | 30.879288 | 73.280038 | 46.580004 | 69.271669 | 46.492000 | 79.417480 |
| 50% | 66.129065 | 36.913319 | 29.655440 | 37.048085 | 59.327154 | 37.845220 | 73.240045 | 37.624207 | 36.620735 | 37.920336 | ... | 47.900511 | 57.638582 | 63.199057 | 75.682465 | 33.034022 | 75.941985 | 47.939031 | 72.395978 | 47.980854 | 82.388899 |
| 75% | 67.332093 | 38.286438 | 30.858673 | 38.333884 | 60.521492 | 39.195431 | 74.978684 | 39.308331 | 37.665280 | 39.143921 | ... | 49.260657 | 60.303524 | 65.302398 | 78.774969 | 35.063575 | 78.884031 | 49.290000 | 75.840286 | 49.551936 | 85.697976 |
| max | 78.013082 | 46.132421 | 42.495172 | 45.980981 | 69.023030 | 47.190316 | 87.032252 | 49.653825 | 47.293746 | 44.887301 | ... | 65.279654 | 75.992731 | 84.767123 | 94.673637 | 50.973348 | 93.443176 | 61.804506 | 93.916338 | 62.438095 | 95.808983 |
8 rows × 30 columns
We perform a sanity check for the data by visualising 64 randomly chosen images along with their key facial points.
Image augmentation¶
Here we create an additional data set where the images are changed slightly to improve the generalisation of the final AI model. We want more data and more variability in e.g. orientation, lighting conditions, or size of the image. This will reduce the likelihood of overfitting and ensuring that the model learns the meaningful "concepts" of emotion recognition. We create this extra data set by creating a copy of the original data set and tweaking it.
I will create 4 types of augmented images:
- horisontal flipping
- randomly increasing brightness
- vertical flipping
- rotation with random angle
(4280, 31)
(6420, 31)
(8560, 31)
(10700, 31)
Data normalization and scaling¶
I normalize the image pixel values to range 0 - 1. This generates better results in neural networks.
# Obtain the x and y coordinates to be used as target
img_target = augmented_df[:,:30]
img_target = np.asarray(img_target).astype(np.float32)
img_target.shape
(10700, 30)
# Split the data into train and test data
X_train_kp, X_test_kp, y_train_kp, y_test_kp = train_test_split(img_array, img_target, test_size=0.2, random_state=42)
X_train_kp.shape
(8560, 96, 96, 1)
X_test_kp.shape
(2140, 96, 96, 1)
y_test_kp.shape
(2140, 30)
y_train_kp.shape
(8560, 30)
Building the Residual Neural Network model for key facial points detection¶
Kernels are used to modify the input by sweeping it over the original input as shown in this animation:
Fig. 4 Performing a convolution on 6x6 input with a 3x3 kernel using stride 1x1. Credit: Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons.
For example, we could perform a 2D convolution for our input with this command:
X = Conv2D(filters=64, kernel_size=(7,7), strides=(2,2), kernel_initializer = glorot_uniform(seed=0))(X_input)
Here we tell the function that we want to
- use 64 distinct filters (each one is a trainable 7×7 “weight grid”).
- use stride 2x2, i.e., the filter jumps 2 pixels at a time, effectively “skipping” every other location.
- intialise the kernels with glorot_uniform method, aka Xavier uniform initialization. This draws samples from a uniform distribution within a specific range, which will be determined from the number of input and output units.
In this section I define the model architecture using Keras. Below is the code to generate Resblocks.
# @title Resblock
def res_block(X, filter, stage):
"""
Implementation of the Resblock.
Arguments:
X -- input tensor
filters -- tuple/list of integers, the number of filters for each conv layer (f1, f2, f3)
stage -- integer, used to name the layers
block -- string, used to name the layers uniquely within a stage
Returns:
X -- output of the res block
"""
### 1: Convolutional block###
# Make a copy of the input
X_shortcut = X
f1, f2, f3 = filter
# ----Long (main) path-----
# Conv2d
X = Conv2D(f1, kernel_size = (1,1), strides = (1,1), name=str(stage)+'convblock'+'_conv_a', \
kernel_initializer = glorot_uniform(seed=0))(X)
# MaxPool2D
X = MaxPool2D(pool_size=(2,2))(X)
# BatchNorm,ReLU
X = BatchNormalization(axis = 3, name=str(stage)+'convblock'+'_bn_a')(X)
X = Activation('relu')(X)
# Conv2D (kernel 3x3)
X = Conv2D(f2, kernel_size = (3,3), strides = (1,1), padding = 'same', name=str(stage)+'convblock'+'_conv_b', \
kernel_initializer = glorot_uniform(seed=0))(X)
# BatchNorm, ReLU
X = BatchNormalization(axis = 3, name=str(stage)+'convblock'+'_bn_b')(X)
X = Activation('relu')(X)
#Conv2D
X = Conv2D(f3, kernel_size = (1,1), strides = (1,1), name=str(stage)+'convblock'+'_conv_c', \
kernel_initializer = glorot_uniform(seed=0))(X)
#BatchNorm, ReLU
X = BatchNormalization(axis = 3, name=str(stage)+'convblock'+'_bn_c')(X)
# ----Short path----
# Conv2D
X_shortcut = Conv2D(f3, kernel_size = (1,1), strides = (1,1), name=str(stage)+'convblock'+'_conv_short', \
kernel_initializer = glorot_uniform(seed=0))(X_shortcut)
# MaxPool2D and Batchnorm
X_shortcut = MaxPool2D(pool_size=(2,2))(X_shortcut)
X_shortcut = BatchNormalization(axis = 3, name=str(stage)+'convblock'+'_bn_short')(X_shortcut)
# ----Add Paths together----
X = Add()([X, X_shortcut])
X = Activation('relu')(X)
### 2: Identity block 1 ###
# Save the input value (shortcut path)
X_shortcut = X
block = 'iden1'
# First component: Conv2D -> BatchNorm -> ReLU
X = Conv2D(f1, (1, 1), strides=(1, 1), name=str(stage) + block + '_conv_a', \
kernel_initializer=glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=str(stage) + block + '_bn_a')(X)
X = Activation('relu')(X)
# Second component: Conv2D (3x3) -> BatchNorm -> ReLU
X = Conv2D(f2, (3, 3), strides=(1, 1), padding='same', name=str(stage) + block + '_conv_b', \
kernel_initializer=glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=str(stage) + block + '_bn_b')(X)
X = Activation('relu')(X)
# Third component: Conv2D (1x1) -> BatchNorm
X = Conv2D(f3, (1, 1), strides=(1, 1), name=str(stage) + block + '_conv_c', \
kernel_initializer=glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=str(stage) + block + '_bn_c')(X)
# Add shortcut value to the main path
X = Add()([X, X_shortcut])
X = Activation('relu')(X)
### 3: Identity block 2 ###
# Save the input value (shortcut path)
X_shortcut = X
block = 'iden2'
# First component: Conv2D -> BatchNorm -> ReLU
X = Conv2D(f1, (1, 1), strides=(1, 1), name=str(stage) + block + '_conv_a', \
kernel_initializer=glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=str(stage) + block + '_bn_a')(X)
X = Activation('relu')(X)
# Second component: Conv2D (3x3) -> BatchNorm -> ReLU
X = Conv2D(f2, (3, 3), strides=(1, 1), padding='same', name=str(stage) + block + '_conv_b', \
kernel_initializer=glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=str(stage) + block + '_bn_b')(X)
X = Activation('relu')(X)
# Third component: Conv2D (1x1) -> BatchNorm
X = Conv2D(f3, (1, 1), strides=(1, 1), name=str(stage) + block + '_conv_c', \
kernel_initializer=glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=str(stage) + block + '_bn_c')(X)
# Add shortcut value to the main path
X = Add()([X, X_shortcut])
X = Activation('relu')(X)
return X
Now that the Resblock is defined we can build the final model.
# @title Final Resnet Neural Network model
input_shape = (96,96,1)
# Input tensor shape
X_input = Input(input_shape)
# Zero-padding
X = ZeroPadding2D((3,3))(X_input)
# Stage 1
X = Conv2D(filters = 64, kernel_size = (7,7), strides = (2,2), name='conv1', \
kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
X = Activation('relu')(X)
X = MaxPooling2D((3,3), strides = (2,2))(X)
# Stage 2
X = res_block(X, filter = [64, 64, 256], stage = 'res1')
# Stage 3
X = res_block(X, filter = [128,128,512], stage = 'res2')
# We could also add more resblocks if we want
# X = res_block(X, filter= [256,256,1024], stage= 'res3')
# Average pooling
X = AveragePooling2D((2,2), name = 'avg_pool')(X)
# Flatten
X = Flatten()(X)
# Dense, ReLU, Dropout
X = Dense(4096, activation = 'relu')(X)
X = Dropout(0.2)(X)
X = Dense(2048, activation = 'relu')(X)
X = Dropout(0.1)(X)
X = Dense(30, activation = 'relu')(X)
model_1_facialKeyPoints = Model(inputs = X_input, outputs = X)
Model: "functional" **************************************************************************** ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ **************************************************************************** | input_layer | (None, 96, 96, 1) | 0 | - | | (InputLayer) | | | | +---------------------+-------------------+------------+-------------------+ | zero_padding2d | (None, 102, 102, | 0 | input_layer[0][0] | | (ZeroPadding2D) | 1) | | | +---------------------+-------------------+------------+-------------------+ | conv1 (Conv2D) | (None, 48, 48, | 3,200 | zero_padding2d[0… | | | 64) | | | +---------------------+-------------------+------------+-------------------+ | bn_conv1 | (None, 48, 48, | 256 | conv1[0][0] | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation | (None, 48, 48, | 0 | bn_conv1[0][0] | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d | (None, 23, 23, | 0 | activation[0][0] | | (MaxPooling2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1convblock_conv… | (None, 23, 23, | 4,160 | max_pooling2d[0]… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_1 | (None, 11, 11, | 0 | res1convblock_co… | | (MaxPooling2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1convblock_bn_a | (None, 11, 11, | 256 | max_pooling2d_1[… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_1 | (None, 11, 11, | 0 | res1convblock_bn… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1convblock_conv… | (None, 11, 11, | 36,928 | activation_1[0][… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1convblock_bn_b | (None, 11, 11, | 256 | res1convblock_co… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_2 | (None, 11, 11, | 0 | res1convblock_bn… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1convblock_conv… | (None, 23, 23, | 16,640 | max_pooling2d[0]… | | (Conv2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res1convblock_conv… | (None, 11, 11, | 16,640 | activation_2[0][… | | (Conv2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_2 | (None, 11, 11, | 0 | res1convblock_co… | | (MaxPooling2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res1convblock_bn_c | (None, 11, 11, | 1,024 | res1convblock_co… | | (BatchNormalizatio… | 256) | | | +---------------------+-------------------+------------+-------------------+ | res1convblock_bn_s… | (None, 11, 11, | 1,024 | max_pooling2d_2[… | | (BatchNormalizatio… | 256) | | | +---------------------+-------------------+------------+-------------------+ | add (Add) | (None, 11, 11, | 0 | res1convblock_bn… | | | 256) | | res1convblock_bn… | +---------------------+-------------------+------------+-------------------+ | activation_3 | (None, 11, 11, | 0 | add[0][0] | | (Activation) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res1iden1_conv_a | (None, 11, 11, | 16,448 | activation_3[0][… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1iden1_bn_a | (None, 11, 11, | 256 | res1iden1_conv_a… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_4 | (None, 11, 11, | 0 | res1iden1_bn_a[0… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1iden1_conv_b | (None, 11, 11, | 36,928 | activation_4[0][… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1iden1_bn_b | (None, 11, 11, | 256 | res1iden1_conv_b… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_5 | (None, 11, 11, | 0 | res1iden1_bn_b[0… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1iden1_conv_c | (None, 11, 11, | 16,640 | activation_5[0][… | | (Conv2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res1iden1_bn_c | (None, 11, 11, | 1,024 | res1iden1_conv_c… | | (BatchNormalizatio… | 256) | | | +---------------------+-------------------+------------+-------------------+ | add_1 (Add) | (None, 11, 11, | 0 | res1iden1_bn_c[0… | | | 256) | | activation_3[0][… | +---------------------+-------------------+------------+-------------------+ | activation_6 | (None, 11, 11, | 0 | add_1[0][0] | | (Activation) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res1iden2_conv_a | (None, 11, 11, | 16,448 | activation_6[0][… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1iden2_bn_a | (None, 11, 11, | 256 | res1iden2_conv_a… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_7 | (None, 11, 11, | 0 | res1iden2_bn_a[0… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1iden2_conv_b | (None, 11, 11, | 36,928 | activation_7[0][… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1iden2_bn_b | (None, 11, 11, | 256 | res1iden2_conv_b… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_8 | (None, 11, 11, | 0 | res1iden2_bn_b[0… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res1iden2_conv_c | (None, 11, 11, | 16,640 | activation_8[0][… | | (Conv2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res1iden2_bn_c | (None, 11, 11, | 1,024 | res1iden2_conv_c… | | (BatchNormalizatio… | 256) | | | +---------------------+-------------------+------------+-------------------+ | add_2 (Add) | (None, 11, 11, | 0 | res1iden2_bn_c[0… | | | 256) | | activation_6[0][… | +---------------------+-------------------+------------+-------------------+ | activation_9 | (None, 11, 11, | 0 | add_2[0][0] | | (Activation) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_conv… | (None, 11, 11, | 32,896 | activation_9[0][… | | (Conv2D) | 128) | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_3 | (None, 5, 5, 128) | 0 | res2convblock_co… | | (MaxPooling2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_bn_a | (None, 5, 5, 128) | 512 | max_pooling2d_3[… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_10 | (None, 5, 5, 128) | 0 | res2convblock_bn… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_conv… | (None, 5, 5, 128) | 147,584 | activation_10[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_bn_b | (None, 5, 5, 128) | 512 | res2convblock_co… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_11 | (None, 5, 5, 128) | 0 | res2convblock_bn… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_conv… | (None, 11, 11, | 131,584 | activation_9[0][… | | (Conv2D) | 512) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_conv… | (None, 5, 5, 512) | 66,048 | activation_11[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_4 | (None, 5, 5, 512) | 0 | res2convblock_co… | | (MaxPooling2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_bn_c | (None, 5, 5, 512) | 2,048 | res2convblock_co… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_bn_s… | (None, 5, 5, 512) | 2,048 | max_pooling2d_4[… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | add_3 (Add) | (None, 5, 5, 512) | 0 | res2convblock_bn… | | | | | res2convblock_bn… | +---------------------+-------------------+------------+-------------------+ | activation_12 | (None, 5, 5, 512) | 0 | add_3[0][0] | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_conv_a | (None, 5, 5, 128) | 65,664 | activation_12[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_bn_a | (None, 5, 5, 128) | 512 | res2iden1_conv_a… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_13 | (None, 5, 5, 128) | 0 | res2iden1_bn_a[0… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_conv_b | (None, 5, 5, 128) | 147,584 | activation_13[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_bn_b | (None, 5, 5, 128) | 512 | res2iden1_conv_b… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_14 | (None, 5, 5, 128) | 0 | res2iden1_bn_b[0… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_conv_c | (None, 5, 5, 512) | 66,048 | activation_14[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_bn_c | (None, 5, 5, 512) | 2,048 | res2iden1_conv_c… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | add_4 (Add) | (None, 5, 5, 512) | 0 | res2iden1_bn_c[0… | | | | | activation_12[0]… | +---------------------+-------------------+------------+-------------------+ | activation_15 | (None, 5, 5, 512) | 0 | add_4[0][0] | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_conv_a | (None, 5, 5, 128) | 65,664 | activation_15[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_bn_a | (None, 5, 5, 128) | 512 | res2iden2_conv_a… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_16 | (None, 5, 5, 128) | 0 | res2iden2_bn_a[0… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_conv_b | (None, 5, 5, 128) | 147,584 | activation_16[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_bn_b | (None, 5, 5, 128) | 512 | res2iden2_conv_b… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_17 | (None, 5, 5, 128) | 0 | res2iden2_bn_b[0… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_conv_c | (None, 5, 5, 512) | 66,048 | activation_17[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_bn_c | (None, 5, 5, 512) | 2,048 | res2iden2_conv_c… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | add_5 (Add) | (None, 5, 5, 512) | 0 | res2iden2_bn_c[0… | | | | | activation_15[0]… | +---------------------+-------------------+------------+-------------------+ | activation_18 | (None, 5, 5, 512) | 0 | add_5[0][0] | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | avg_pool | (None, 2, 2, 512) | 0 | activation_18[0]… | | (AveragePooling2D) | | | | +---------------------+-------------------+------------+-------------------+ | flatten (Flatten) | (None, 2048) | 0 | avg_pool[0][0] | +---------------------+-------------------+------------+-------------------+ | dense (Dense) | (None, 4096) | 8,392,704 | flatten[0][0] | +---------------------+-------------------+------------+-------------------+ | dropout (Dropout) | (None, 4096) | 0 | dense[0][0] | +---------------------+-------------------+------------+-------------------+ | dense_1 (Dense) | (None, 2048) | 8,390,656 | dropout[0][0] | +---------------------+-------------------+------------+-------------------+ | dropout_1 (Dropout) | (None, 2048) | 0 | dense_1[0][0] | +---------------------+-------------------+------------+-------------------+ | dense_2 (Dense) | (None, 30) | 61,470 | dropout_1[0][0] | └---------------------┴-------------------┴------------┴-------------------┘ Total params: 18,016,286 (68.73 MB) Trainable params: 18,007,710 (68.69 MB) Non-trainable params: 8,576 (33.50 KB)
Explanations of components¶
The Zeropadding adds a border of zeros (3 pixels wide) around the input image. This will prevent information loss at the edges of convolutions.
Conv2D is the cake of the convolutional layer. It applies the filters to the input image and slides them with a set stride. This way the features are extracted from the image.
The BatchNormalisation layer normalizes the output of the convolution, making training more stable. We can say it is the smooth cream layer on our convolution cake.
The ReLU activation function introduces non-linearity to the model.
MaxpPooling2D reduces the spatial dimensions of the feature maps by taking the maximum value in a window and so downsamples the output. After the Resblock, AveragePooling2D is used similar to MaxPooling, except it calculates the average value within the window. It also reduces the size of the feature maps. Just to give an impression of the impact of pooling, if we removed the MaxPooling 2D layers from Resblocks the final model would have 256 million parameters - instead of 18 million.
Flatten converts the multi-dimensional feature maps into a single, long vector, preparing the data for the fully connected layers.
Dense creates a fully connected layer where each neuron is connected to every neuron in the previous layer. These fully connected layers will process the features exrtacted by the convolutional layers.
Dropout layers are a regularisation technique which drops a set percentage of the neurons during training by setting them to zero. This makes the model less likely to overfit, and decreases the interdependency between the neurons. Therefore we improve the performance of the network and the generalisability of the model.
The final model has a very complex structure, 18 million trainable parameters, which allows it to learn to identify emotions as good or even better than average human. However, too many parameters can lead to problems, such as overfitting and slow or nonconverging training. Optimising this many parameters is not a trivial task.
Compiling and training the model¶
I will use the Adam optimization method for the training. Adam is a computationally efficient stochastic gradient method and it combines the gradient descent with momentum and the RMSP algorithm.
As discussed earlier, the momentum speeds the training by accelerating the gradients by adding a fraction of the previous gradient to the current one. The RMSP or Root Mean Square Propagation is an adaptive learning algorithm that takes the 'exponential moving average' of the gradients. In other words, it adapts the learning rate for each parameter by keeping track of an exponentially decaying average of past squared gradients.
The algortihm will proceed as follows:
1. Calculate the gradient $g_t$
$g_t = \frac{\delta L }{\delta w_t}$
2. Update the Biased first moment estimate $m_t$
$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$
This is similar to calculating the momentum as we keep track of the decaying average of past gradients.
3. Update the Biased Second Moment Estimate $v_t$
$v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$
This is similar to RMSP as we keep track of an exponentially decaying average of past squared gradients.
4. Bias correction for $m_t$ and $v_t$
Especially at the beginning of training, $m_t$ and $v_t$ are biased toward zero (because the y are initialised at zero). This is corrected by Adam like this:
$\hat m = \frac{m_t}{1-\beta_1^t}$, $\hat v = \frac{v_t}{1-\beta_2^t}$
5. Parameter update
$w_{t} = w_{t_1} - \alpha_t\frac{\hat m_t}{(v_t+\epsilon)^{1/2}}*g_t$
where,
$g_t$ = gradient of the loss with respect to the parameters at iteration $t$
$\alpha_t$ = learning rate at iteration $t$
$\beta_1, \beta_2$ = decay rates for the moment estimates
$\epsilon$ = small constant to prevent division by zero
The tensorflow tool for Adam optimization accepts several arguments as input:
learning_rate: can be a float or a scheduler that optimizes the learning ratebeta_1: A value or constant tensor (float) that tells the exponential decay rate for the 1st moment estimates, i.e. the means of the gradients. Default = 0.9.beta_2= A value or constant tensor (float) that tells the exponential decay rate for the 2nd moment estimates, i.e. the uncentered variance of the squared gradients. Default 0.999.amsgrad= True/False. Wether the AMSGrad variant of the algorithm presented in the paper On the Convergence of Adam and beyond shall be applied. Default = False.weight_decay= If set the weight decay will be set.
Other things to consider when optimising¶
The batch size determines how many training examples are processed before the model's internal parameters are updated. Smaller batch sizes can speed up the training per epoch because the model updates more frequently. However, this can lead to less stable convergence, i.e. the training loss may fluctuate more. A small batch size can be beneficial in case the model is overfitting (the trianing loss is significantly lower than the validation loss).
A larger batch size leads to slower training per epoch and requires moe memory, but can yield more stable updates for the parameters. The model usually converges more smoothly, but might not generalise as well due to "sharp minima".
Another way to tune the parameters of optimization is to use learning rate schedulers. Why? As training progresses, the model gets closer to a good solution. Smaller learning rates allow for finer adjustments to the model's weights, helping it converge to a better minimum without overshooting (see the gradient descent examples in the beginning). I have implemented a learning rate algorithm that reduces the learning rate if the validation loss does not improve in 5 epochs.
After training, the model is saved in a .keras file. The .keras is a zip archive that contains:
- The architecture
- The weights
- The optimizer's status
# @title Compiling and training with 3 epochs
run_example = False
if run_example:
adam = tf.keras.optimizers.Adam(learning_rate = 0.0001, beta_1 = 0.9, \
beta_2 = 0.999, amsgrad = False)
model_3_facialKeyPoints = Model(inputs = X_input, outputs = X)
model_3_facialKeyPoints.compile(loss = "mean_squared_error", optimizer = adam, \
metrics = ['accuracy'])
#Save the best model with least validation loss here
checkpoint = ModelCheckpoint(filepath = "Models/FacialKeyPoints_model_3.keras", \
verbose = 1, save_best_only = True)
history3 = model_3_facialKeyPoints.fit(X_train_kp, y_train_kp, batch_size = 32, \
epochs = 3, validation_split = 0.05, callbacks=[checkpoint])
# @title Compiling and training with batch_size = 64, epochs = 100, and decay on plateu of the learning rate
if retrain_model:
initial_learning_rate=0.0008
# compile model
adam = tf.keras.optimizers.Adam(learning_rate = initial_learning_rate, beta_1 = 0.9, \
beta_2 = 0.999, amsgrad = False)
model_1_facialKeyPoints = Model(inputs = X_input, outputs = X)
model_1_facialKeyPoints.compile(loss = "mean_squared_error", optimizer = adam, \
metrics = ['accuracy'])
# Callbacks: reduce lr on plateau
reduce_lr = ReduceLROnPlateau(
monitor='val_loss',
factor=0.65,
patience=5,
min_lr=1e-8,
verbose=1
)
early = EarlyStopping(
monitor='val_loss',
patience=12,
restore_best_weights=True,
verbose=1,
mode = 'min'
)
# Callbacks: save best model
checkpoint = ModelCheckpoint(
filepath="Models/FacialKeyPoints_model_1.keras",
verbose=1,
save_best_only=True
)
# Callbacks: logs epoch results to CSV
csv_logger = CSVLogger(
'Models/training_history_model_1.csv',
append=True, # keep adding if file exists
separator=',' # comma-separated
)
# fit with CSVLogger included
history = model_1_facialKeyPoints.fit(
X_train_kp, y_train_kp,
batch_size=64,
epochs=100,
validation_split=0.05,
callbacks=[checkpoint, reduce_lr, csv_logger, early]
)
print(X_train_kp.shape) # e.g. (N, 96, 96, 1)
print(y_train_kp.shape) # should print (N, 30)$
(8560, 96, 96, 1) (8560, 30)
Assessing the trained key facial points detection model performance¶
# load the model architecture f = final
adam = tf.keras.optimizers.Adam(learning_rate = 0.0001, beta_1 = 0.9, \
beta_2 = 0.999, amsgrad = False)
model_1_facialKeyPoints = tf.keras.models.load_model("Models/FacialKeyPoints_model_1.keras")
model_1_facialKeyPoints.compile(loss = "mean_squared_error", optimizer = adam, \
metrics = ['accuracy'])
# Evaluate the model
# The model from materials has loss: 8.3705 accuracy: 0.85280377 with the X_test,y_test set.
result = model_1_facialKeyPoints.evaluate(X_test_kp, y_test_kp)
67/67 ━━━━━━━━━━━━━━━━━━━━ 6s 35ms/step - accuracy: 0.8072 - loss: 37.0998
67/67 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step
predicted_kp = pd.DataFrame(predicted_kp, columns=columns)
predicted_kp
| left_eye_center_x | left_eye_center_y | right_eye_center_x | right_eye_center_y | left_eye_inner_corner_x | left_eye_inner_corner_y | left_eye_outer_corner_x | left_eye_outer_corner_y | right_eye_inner_corner_x | right_eye_inner_corner_y | ... | nose_tip_x | nose_tip_y | mouth_left_corner_x | mouth_left_corner_y | mouth_right_corner_x | mouth_right_corner_y | mouth_center_top_lip_x | mouth_center_top_lip_y | mouth_center_bottom_lip_x | mouth_center_bottom_lip_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27.774412 | 41.608402 | 69.701347 | 39.645924 | 36.824764 | 42.768158 | 18.683453 | 43.374584 | 61.305141 | 41.332291 | ... | 51.949940 | 64.388763 | 33.121056 | 88.780327 | 69.142975 | 87.274284 | 51.623299 | 84.660240 | 52.020306 | 94.269218 |
| 1 | 63.867874 | 59.837021 | 29.558840 | 56.851757 | 57.666470 | 58.394218 | 70.214401 | 59.508846 | 35.728714 | 56.771957 | ... | 48.389038 | 41.098076 | 64.282990 | 24.717127 | 35.385838 | 22.573709 | 49.672440 | 26.816153 | 50.343761 | 18.170223 |
| 2 | 66.701538 | 38.380432 | 30.169611 | 36.045486 | 60.238499 | 39.013817 | 73.072281 | 39.661930 | 36.369148 | 37.416374 | ... | 47.273544 | 59.190262 | 60.052547 | 82.078575 | 32.966862 | 80.368378 | 46.807598 | 75.696198 | 46.202122 | 88.636078 |
| 3 | 29.851618 | 40.825123 | 68.809952 | 39.051144 | 37.333893 | 41.556328 | 22.312656 | 42.364140 | 61.609417 | 40.434784 | ... | 49.816231 | 61.445831 | 34.389442 | 85.753059 | 66.853058 | 84.536171 | 50.709244 | 78.521538 | 50.523697 | 94.321213 |
| 4 | 29.966978 | 34.698601 | 66.767433 | 40.271515 | 37.919559 | 37.025856 | 21.968967 | 34.297169 | 59.200417 | 40.350548 | ... | 47.481728 | 60.271687 | 26.436144 | 72.177620 | 59.407753 | 76.562050 | 43.661575 | 75.111107 | 42.907543 | 80.085808 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2135 | 69.228844 | 55.780846 | 44.195251 | 25.819244 | 63.851627 | 50.836746 | 72.786087 | 62.208595 | 48.127544 | 32.177284 | ... | 40.634342 | 54.915756 | 34.938332 | 80.414986 | 15.325096 | 55.738548 | 28.048006 | 64.707710 | 19.336086 | 73.635452 |
| 2136 | 64.919174 | 63.792625 | 55.612881 | 27.871506 | 62.441170 | 57.269558 | 65.713219 | 70.790627 | 56.210602 | 34.523991 | ... | 39.838898 | 51.735191 | 25.973181 | 69.893890 | 18.873905 | 40.596218 | 25.673395 | 54.140991 | 15.005697 | 58.321358 |
| 2137 | 67.449257 | 36.665802 | 29.131422 | 36.421165 | 60.840542 | 36.884842 | 74.299210 | 37.797375 | 35.730961 | 36.823277 | ... | 46.935879 | 49.284603 | 63.658436 | 73.521004 | 32.097244 | 73.253464 | 47.337311 | 65.122261 | 46.956753 | 81.887741 |
| 2138 | 67.602493 | 59.835785 | 52.191608 | 26.726240 | 64.597107 | 54.131611 | 69.025040 | 66.138535 | 53.892822 | 32.844337 | ... | 42.994911 | 50.666443 | 32.735664 | 70.033546 | 21.022322 | 42.965286 | 30.660442 | 54.207699 | 20.245888 | 60.698238 |
| 2139 | 30.817675 | 35.639484 | 64.787453 | 37.358612 | 37.058781 | 37.048462 | 24.390444 | 36.065876 | 58.306847 | 38.046780 | ... | 45.163116 | 57.671455 | 31.936731 | 73.423302 | 60.704609 | 74.940430 | 45.779903 | 71.358398 | 45.488407 | 81.041817 |
2140 rows × 30 columns
# @title Printing out samples of predictions
fig, axes = plt.subplots(4,4, figsize=(10,10))
axes = axes.ravel()
for i in range(16):
axes[i].imshow(X_test_kp[i].reshape(96,96), cmap='gray')
axes[i].axis('off')
for j in range(1,31,2):
axes[i].plot(predicted_kp.iloc[i,j-1],predicted_kp.iloc[i,j], marker='.', color='r')
#plt.tight_layout()
plt.show()
Part 2. Facial Expression detection¶
In this second part of the project, I train the second model which will classify emotions. The data contains images that belong to 5 categories:
- 0 = Angry
- 1 = Disgust
- 2 = Sad
- 3 = Happy
- 4 = Surprise
The images in the data set are of size 48px * 48px. Therefore they need to be resized so that we can run the Expression detection model with the Key facial point detection model together.
Below is an example of an original image, results from resizing and final image after interpolation.
Visualising the images in the dataset with the emotions¶
expression_df.head()
| emotion | pixels | |
|---|---|---|
| 0 | 0 | [[69.316925, 73.03865, 79.13719, 84.17186, 85.... |
| 1 | 0 | [[151.09435, 150.91393, 150.65791, 148.96367, ... |
| 2 | 2 | [[23.061905, 25.50914, 29.47847, 33.99843, 36.... |
| 3 | 2 | [[20.083221, 19.079437, 17.398712, 17.158691, ... |
| 4 | 3 | [[76.26172, 76.54747, 77.001785, 77.7672, 78.4... |
Below is the counts of each emotion category. Our data is extremely unbalanced with very few images portraying disgust and many images within category happy.
Data preparation and image augmentation¶
X shape (24568, 96, 96, 1) y shape (24568, 5) X train shape (22111, 96, 96, 1) y train shape (22111, 5) X val shape (1228, 96, 96, 1) y val shape (1228, 5) X test shape (1229, 96, 96, 1) y test shape (1229, 5)
Data preprocessing¶
In the data preprocessing I will again normalize the data and perform image augmentation, as was done in the Part 1. of the project.
First, I normalize the data to conatin values between 0 and 1. Then, I use the following image augmentation techniques:
- rotating up to 15 degrees
- shifting the image horisontally up to 0.1*image width
- shifting the image vertically up to 0.1*image height
- shearing the image up to 0.1
- zooming the image up to 10 %
- horisontally flipping the image
- vertically flipping the image
- Adjusting the brightness
The spaces outside the boundaries are filled by replicting the nearest pixels.
Build and train Deep Learning model for facial expression classification¶
The model I will build has the following architecture:
# @title Emotion recognition model
input_shape = (96,96,1)
# Input tensor shape
X_input = Input(input_shape)
# Zero-padding
X = ZeroPadding2D((3,3))(X_input)
# Stage 1
X = Conv2D(64, (7,7), strides = (2,2), name = 'conv1', kernel_initializer=glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = 'bn1')(X)
X = Activation('relu')(X)
X = MaxPooling2D((3,3), strides = (2,2))(X)
# Stage 2
X = res_block(X, filter = [64,64,256], stage = 'res2')
# Stage 3
X = res_block(X, filter = [128,128,512], stage = 'res3')
# Stage 4 (optional)
#X = res_block(X, filter= [256,256,1024], stage = 'res4')
# Average pooling
X = AveragePooling2D((4,4), name = 'avg_pool')(X)
# Final layer
X = Flatten()(X)
X = Dense(5, activation = 'softmax', name = 'dense', kernel_initializer=glorot_uniform(seed=0))(X)
Emotion_det_model_2 = Model(inputs = X_input, outputs = X, name = 'Resnet18')
Model: "Resnet18" **************************************************************************** ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ **************************************************************************** | input_layer_1 | (None, 96, 96, 1) | 0 | - | | (InputLayer) | | | | +---------------------+-------------------+------------+-------------------+ | zero_padding2d_1 | (None, 102, 102, | 0 | input_layer_1[0]… | | (ZeroPadding2D) | 1) | | | +---------------------+-------------------+------------+-------------------+ | conv1 (Conv2D) | (None, 48, 48, | 3,200 | zero_padding2d_1… | | | 64) | | | +---------------------+-------------------+------------+-------------------+ | bn1 | (None, 48, 48, | 256 | conv1[0][0] | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_19 | (None, 48, 48, | 0 | bn1[0][0] | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_5 | (None, 23, 23, | 0 | activation_19[0]… | | (MaxPooling2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_conv… | (None, 23, 23, | 4,160 | max_pooling2d_5[… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_6 | (None, 11, 11, | 0 | res2convblock_co… | | (MaxPooling2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_bn_a | (None, 11, 11, | 256 | max_pooling2d_6[… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_20 | (None, 11, 11, | 0 | res2convblock_bn… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_conv… | (None, 11, 11, | 36,928 | activation_20[0]… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_bn_b | (None, 11, 11, | 256 | res2convblock_co… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_21 | (None, 11, 11, | 0 | res2convblock_bn… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_conv… | (None, 23, 23, | 16,640 | max_pooling2d_5[… | | (Conv2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_conv… | (None, 11, 11, | 16,640 | activation_21[0]… | | (Conv2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_7 | (None, 11, 11, | 0 | res2convblock_co… | | (MaxPooling2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_bn_c | (None, 11, 11, | 1,024 | res2convblock_co… | | (BatchNormalizatio… | 256) | | | +---------------------+-------------------+------------+-------------------+ | res2convblock_bn_s… | (None, 11, 11, | 1,024 | max_pooling2d_7[… | | (BatchNormalizatio… | 256) | | | +---------------------+-------------------+------------+-------------------+ | add_6 (Add) | (None, 11, 11, | 0 | res2convblock_bn… | | | 256) | | res2convblock_bn… | +---------------------+-------------------+------------+-------------------+ | activation_22 | (None, 11, 11, | 0 | add_6[0][0] | | (Activation) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_conv_a | (None, 11, 11, | 16,448 | activation_22[0]… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_bn_a | (None, 11, 11, | 256 | res2iden1_conv_a… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_23 | (None, 11, 11, | 0 | res2iden1_bn_a[0… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_conv_b | (None, 11, 11, | 36,928 | activation_23[0]… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_bn_b | (None, 11, 11, | 256 | res2iden1_conv_b… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_24 | (None, 11, 11, | 0 | res2iden1_bn_b[0… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_conv_c | (None, 11, 11, | 16,640 | activation_24[0]… | | (Conv2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res2iden1_bn_c | (None, 11, 11, | 1,024 | res2iden1_conv_c… | | (BatchNormalizatio… | 256) | | | +---------------------+-------------------+------------+-------------------+ | add_7 (Add) | (None, 11, 11, | 0 | res2iden1_bn_c[0… | | | 256) | | activation_22[0]… | +---------------------+-------------------+------------+-------------------+ | activation_25 | (None, 11, 11, | 0 | add_7[0][0] | | (Activation) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_conv_a | (None, 11, 11, | 16,448 | activation_25[0]… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_bn_a | (None, 11, 11, | 256 | res2iden2_conv_a… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_26 | (None, 11, 11, | 0 | res2iden2_bn_a[0… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_conv_b | (None, 11, 11, | 36,928 | activation_26[0]… | | (Conv2D) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_bn_b | (None, 11, 11, | 256 | res2iden2_conv_b… | | (BatchNormalizatio… | 64) | | | +---------------------+-------------------+------------+-------------------+ | activation_27 | (None, 11, 11, | 0 | res2iden2_bn_b[0… | | (Activation) | 64) | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_conv_c | (None, 11, 11, | 16,640 | activation_27[0]… | | (Conv2D) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res2iden2_bn_c | (None, 11, 11, | 1,024 | res2iden2_conv_c… | | (BatchNormalizatio… | 256) | | | +---------------------+-------------------+------------+-------------------+ | add_8 (Add) | (None, 11, 11, | 0 | res2iden2_bn_c[0… | | | 256) | | activation_25[0]… | +---------------------+-------------------+------------+-------------------+ | activation_28 | (None, 11, 11, | 0 | add_8[0][0] | | (Activation) | 256) | | | +---------------------+-------------------+------------+-------------------+ | res3convblock_conv… | (None, 11, 11, | 32,896 | activation_28[0]… | | (Conv2D) | 128) | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_8 | (None, 5, 5, 128) | 0 | res3convblock_co… | | (MaxPooling2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3convblock_bn_a | (None, 5, 5, 128) | 512 | max_pooling2d_8[… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_29 | (None, 5, 5, 128) | 0 | res3convblock_bn… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res3convblock_conv… | (None, 5, 5, 128) | 147,584 | activation_29[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3convblock_bn_b | (None, 5, 5, 128) | 512 | res3convblock_co… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_30 | (None, 5, 5, 128) | 0 | res3convblock_bn… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res3convblock_conv… | (None, 11, 11, | 131,584 | activation_28[0]… | | (Conv2D) | 512) | | | +---------------------+-------------------+------------+-------------------+ | res3convblock_conv… | (None, 5, 5, 512) | 66,048 | activation_30[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | max_pooling2d_9 | (None, 5, 5, 512) | 0 | res3convblock_co… | | (MaxPooling2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3convblock_bn_c | (None, 5, 5, 512) | 2,048 | res3convblock_co… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | res3convblock_bn_s… | (None, 5, 5, 512) | 2,048 | max_pooling2d_9[… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | add_9 (Add) | (None, 5, 5, 512) | 0 | res3convblock_bn… | | | | | res3convblock_bn… | +---------------------+-------------------+------------+-------------------+ | activation_31 | (None, 5, 5, 512) | 0 | add_9[0][0] | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden1_conv_a | (None, 5, 5, 128) | 65,664 | activation_31[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden1_bn_a | (None, 5, 5, 128) | 512 | res3iden1_conv_a… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_32 | (None, 5, 5, 128) | 0 | res3iden1_bn_a[0… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden1_conv_b | (None, 5, 5, 128) | 147,584 | activation_32[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden1_bn_b | (None, 5, 5, 128) | 512 | res3iden1_conv_b… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_33 | (None, 5, 5, 128) | 0 | res3iden1_bn_b[0… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden1_conv_c | (None, 5, 5, 512) | 66,048 | activation_33[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden1_bn_c | (None, 5, 5, 512) | 2,048 | res3iden1_conv_c… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | add_10 (Add) | (None, 5, 5, 512) | 0 | res3iden1_bn_c[0… | | | | | activation_31[0]… | +---------------------+-------------------+------------+-------------------+ | activation_34 | (None, 5, 5, 512) | 0 | add_10[0][0] | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden2_conv_a | (None, 5, 5, 128) | 65,664 | activation_34[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden2_bn_a | (None, 5, 5, 128) | 512 | res3iden2_conv_a… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_35 | (None, 5, 5, 128) | 0 | res3iden2_bn_a[0… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden2_conv_b | (None, 5, 5, 128) | 147,584 | activation_35[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden2_bn_b | (None, 5, 5, 128) | 512 | res3iden2_conv_b… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | activation_36 | (None, 5, 5, 128) | 0 | res3iden2_bn_b[0… | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden2_conv_c | (None, 5, 5, 512) | 66,048 | activation_36[0]… | | (Conv2D) | | | | +---------------------+-------------------+------------+-------------------+ | res3iden2_bn_c | (None, 5, 5, 512) | 2,048 | res3iden2_conv_c… | | (BatchNormalizatio… | | | | +---------------------+-------------------+------------+-------------------+ | add_11 (Add) | (None, 5, 5, 512) | 0 | res3iden2_bn_c[0… | | | | | activation_34[0]… | +---------------------+-------------------+------------+-------------------+ | activation_37 | (None, 5, 5, 512) | 0 | add_11[0][0] | | (Activation) | | | | +---------------------+-------------------+------------+-------------------+ | avg_pool | (None, 1, 1, 512) | 0 | activation_37[0]… | | (AveragePooling2D) | | | | +---------------------+-------------------+------------+-------------------+ | flatten_1 (Flatten) | (None, 512) | 0 | avg_pool[0][0] | +---------------------+-------------------+------------+-------------------+ | dense (Dense) | (None, 5) | 2,565 | flatten_1[0][0] | └---------------------┴-------------------┴------------┴-------------------┘ Total params: 1,174,021 (4.48 MB) Trainable params: 1,165,445 (4.45 MB) Non-trainable params: 8,576 (33.50 KB)
print(f"Training samples: {len(X_train_ed)}")
print(f"Batch size: {64}")
steps_per_epoch=np.ceil(len(X_train_ed) / 64).astype(int)
print(f"Steps per epoch: {steps_per_epoch}")
Training samples: 22111 Batch size: 64 Steps per epoch: 346
Evaluate model¶
Confusion matrix, accuracy, precision, and recall
<matplotlib.legend.Legend at 0x7d7a3e15ce50>
39/39 ━━━━━━━━━━━━━━━━━━━━ 5s 51ms/step - accuracy: 0.7534 - loss: 0.5960
39/39 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step
Text(70.72222222222221, 0.5, 'True')
print(classification_report(true_classes, predicted_classes))
precision recall f1-score support
0 0.68 0.65 0.66 245
1 0.46 0.27 0.34 22
2 0.62 0.72 0.67 319
3 0.86 0.84 0.85 458
4 0.87 0.77 0.81 185
accuracy 0.75 1229
macro avg 0.70 0.65 0.67 1229
weighted avg 0.76 0.75 0.75 1229
The above table tells us that the classes where we had the least data (# support) have the weakest performance. Precision (percentage of samples predicted to be class x that are actually x) and recall (percentage of x samples in data that are correctly labeled as x) are highest in class 3 where we also had the most samples. f1 -score is the harmonic mean of precision and recall and it is calculated as
$F_1 = \frac{\text{precision} \ \times \ \text{recall}}{\text{precision} \ +\ \text{recall}}$
Part 3. Combining the key point detection and facial expression recognition models¶
df_predict = predict(X_test_ed)
39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 30ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
df_predict.head()
| left_eye_center_x | left_eye_center_y | right_eye_center_x | right_eye_center_y | left_eye_inner_corner_x | left_eye_inner_corner_y | left_eye_outer_corner_x | left_eye_outer_corner_y | right_eye_inner_corner_x | right_eye_inner_corner_y | ... | nose_tip_y | mouth_left_corner_x | mouth_left_corner_y | mouth_right_corner_x | mouth_right_corner_y | mouth_center_top_lip_x | mouth_center_top_lip_y | mouth_center_bottom_lip_x | mouth_center_bottom_lip_y | emotion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 66.610451 | 40.496799 | 36.849854 | 25.179127 | 59.685787 | 38.832348 | 72.507202 | 43.841587 | 42.234669 | 28.970413 | ... | 51.316212 | 49.191910 | 70.809021 | 23.674337 | 58.018814 | 36.823456 | 64.206169 | 32.734856 | 67.690872 | 3 |
| 1 | 64.653389 | 38.042404 | 29.781794 | 34.580994 | 57.754307 | 38.556492 | 71.875885 | 39.505543 | 36.547447 | 36.649067 | ... | 59.056892 | 58.322350 | 77.191147 | 28.911465 | 74.403419 | 43.987125 | 74.182167 | 43.264114 | 80.884209 | 0 |
| 2 | 60.960011 | 37.275005 | 33.889050 | 33.123466 | 55.079693 | 37.744541 | 66.276917 | 38.548401 | 38.508617 | 35.537212 | ... | 59.973900 | 51.810165 | 79.506187 | 30.799669 | 76.300369 | 40.469513 | 78.190208 | 40.058804 | 79.434502 | 2 |
| 3 | 56.973568 | 37.822292 | 25.538334 | 38.267265 | 50.197941 | 38.189533 | 64.570465 | 37.789585 | 32.019379 | 38.396927 | ... | 44.175758 | 59.712654 | 48.962036 | 30.902287 | 48.933670 | 44.473202 | 48.845013 | 45.226276 | 49.808178 | 3 |
| 4 | 62.833557 | 40.498875 | 31.723820 | 37.697025 | 56.710606 | 41.063759 | 69.142136 | 41.633850 | 37.679867 | 39.783356 | ... | 58.978386 | 57.483547 | 75.107430 | 30.617802 | 73.028336 | 44.611389 | 72.902473 | 44.198177 | 77.997261 | 0 |
5 rows × 31 columns
Plotting test images of the combined models.
Enumerating objects: 1 Enumerating objects: 11, done. Counting objects: 9% (1/11) Counting objects: 18% (2/11) Counting objects: 27% (3/11) Counting objects: 36% (4/11) Counting objects: 45% (5/11) Counting objects: 54% (6/11) Counting objects: 63% (7/11) Counting objects: 72% (8/11) Counting objects: 81% (9/11) Counting objects: 90% (10/11) Counting objects: 100% (11/11) Counting objects: 100% (11/11), done. Delta compression using up to 2 threads Compressing objects: 100% (6/6), done. Writing objects: 100% (6/6), 3.83 KiB | 145.00 KiB/s, done. Total 6 (delta 4), reused 0 (delta 0), pack-reused 0 remote: Resolving deltas: 100% (4/4), completed with 4 local objects. To https://github.com/KaisuH/Emotion-AI.git 283ee53..3cc3676 main -> main